Word embeddings

Inspired by http://creatingdata.us/etc/streets/ we try to compute word vectors for each streetname and then visualize the relationship between streetnames.

In [1]:
import pandas as pd
from gensim.models import Word2Vec
import numpy as np
import multiprocessing
from umap import UMAP
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()

Creating the Word2Vec models

Streets grouped by postcode

Reading the csv file of all streets into memory, this can be generated by the filter script. This assumes you have the file saved in the following folder

In [2]:
streets = pd.read_csv('../data/streets.csv')
# Group streets by postcode
groups = streets.groupby('postcode')

To train the Word2Vec model we normally need a list of sentences, usually mined from some text source. One way to emulate this with streetnames we consider each postcode a different sentence and list all the streetnames for a particular postcode.

In [3]:
# Create a list of lists containing the streetnames of each city
cleaned = []
for m_id, values in groups:
    city = []
    for nl, fr in values[['streetname_nl', 'streetname_fr']].values:
        # try to add nl and fr names, handling the cases where they are null
        try:
            city.append(nl.lower())
        except:
            pass
        try:
            city.append(fr.lower())
        except:
            pass
    cleaned.append(city)

Now we can calculate the word vectors for each streetname. First we create our model. Notable parameters are the min_count stating that only when a street occurs at least twice it can be included in the word vectors and the window which specifies how far words can be from eachother to still be associated with eachother (10 is the maximum value).

In [4]:
cores = multiprocessing.cpu_count() # Count the number of cores in a computer
postcode_grouped = Word2Vec(min_count=2,
                     window=10,
                     size=300,
                     sample=6e-5, 
                     alpha=0.03, 
                     min_alpha=0.0007, 
                     negative=20,
                     workers=cores-1)
# Build vocabulary, dropping streets that only occur once
postcode_grouped.build_vocab(cleaned, progress_per=10000)
# Train the word vectors
postcode_grouped.train(cleaned, total_examples=postcode_grouped.corpus_count, epochs=30, report_delay=1)
# Optimize model
postcode_grouped.init_sims(replace=True)

This model is able to extract some of the associations between the occurrences of streets in different cities. Especially when looking at commonly occuring streetnames we can see some other streets in the same style.

In [5]:
postcode_grouped.wv.most_similar(positive=["dorpsstraat"])
Out[5]:
[('waterstraat', 0.9999365210533142),
 ('kortestraat', 0.9999345541000366),
 ('lindestraat', 0.9999336004257202),
 ('heidestraat', 0.9999011158943176),
 ('eikenlaan', 0.9998989105224609),
 ('markt', 0.9998941421508789),
 ('vaartstraat', 0.9998708963394165),
 ('rozenstraat', 0.999870777130127),
 ('pastorijstraat', 0.9998698830604553),
 ('kasteeldreef', 0.999805212020874)]

Streets grouped on geolocation

We can group streets using geolocation as well (the geolocation of the addresses belonging to that street). We collect the streets in bins and make sure there is overlap between bins so neighboorhoods should be contained in at least one bin. Each bin is then interpreted as a 'sentence' for the creation of the word vectors.

Read the addresses csv file, this assumes you have the file saved in the following folder

In [6]:
addresses = pd.read_csv('../data/belgium_addresses.csv')
In [7]:
min_x = addresses['EPSG:31370_x'].min()
max_x = addresses['EPSG:31370_x'].max()
min_y = addresses['EPSG:31370_y'].min()
max_y = addresses['EPSG:31370_y'].max()
In [8]:
binsize = 1000
coll = {}
# Get the ids of the necessary columns
x_id = addresses.columns.get_loc('EPSG:31370_x')
y_id = addresses.columns.get_loc('EPSG:31370_y')
nl_id = addresses.columns.get_loc('streetname_nl')
fr_id = addresses.columns.get_loc('streetname_fr')

for row in addresses.values:
    # Get the bin offsets
    x = (row[0] // binsize) * binsize
    y = (row[1] // binsize) * binsize
    bins = [
        (x, y),
        (x + binsize/2, y),
        (x, y + binsize/2),
        (x + binsize/2, y + binsize/2),
    ]
    for pos in bins:
        if pos not in coll:
            coll[pos] = set()
        try:
            coll[pos].add(row[nl_id].lower())
        except:
            pass
        try:
            coll[pos].add(row[fr_id].lower())
        except:
            pass
        
blocks = [list(el) for el in coll.values() if el]
In [9]:
geo_grouped = Word2Vec(min_count=10,
                     window=10,
                     size=300,
                     sample=6e-5, 
                     alpha=0.03, 
                     min_alpha=0.0007, 
                     negative=20,
                     workers=cores-1)

# Build vocabulary
geo_grouped.build_vocab(blocks, progress_per=10000)
# Train the word vectors
geo_grouped.train(blocks, total_examples=geo_grouped.corpus_count, epochs=30, report_delay=1)
# Optimize the model
geo_grouped.init_sims(replace=True)
In [10]:
geo_grouped.most_similar(positive=["dorpsstraat"])
Out[10]:
[('kruisstraat', 0.6116258502006531),
 ('kewithdreef', 0.5953519940376282),
 ('kerkhofstraat', 0.5736434459686279),
 ('stationsstraat', 0.5568470358848572),
 ('kortestraat', 0.5566809177398682),
 ('oude kerkstraat', 0.550558865070343),
 ('zandstraat', 0.5504564642906189),
 ('kloosterstraat', 0.5420650243759155),
 ('patershof', 0.54120272397995),
 ('dorpsplein', 0.5305353403091431)]

Visualizing the word vectors

The models generate word vectors with 300 dimensions, this kind of data is not easily visualized. To solve this we can apply dimensionality reduction, using the UMAP algorithm, to reduce the amount of dimensions to two which can then be visualized in a scatterplot

In [11]:
reducer = UMAP()

Postcode groups

In [12]:
# Extract the vectors from the model
vectors = []
for word in postcode_grouped.wv.vocab:
    vectors.append(postcode_grouped.wv[word])

vectors = np.array(vectors)

# Create the low dimensional embedding
embedding = reducer.fit_transform(vectors)
In [13]:
plt.figure(figsize=(14, 9))
sns.scatterplot(embedding[:, 0], embedding[:, 1])
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7efd1474b588>

Geo groups

In [14]:
# Extract the vectors from the model
vectors = []
for word in geo_grouped.wv.vocab:
    vectors.append(geo_grouped.wv[word])

vectors = np.array(vectors)

# Create the low dimensional embedding
embedding = reducer.fit_transform(vectors)
In [15]:
plt.figure(figsize=(14, 9))
sns.scatterplot(embedding[:, 0], embedding[:, 1])
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7efc1c4d89e8>